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^ Abstract 

Models defined by stochastic differential equations (SDEs) allow for the representation of 
^ random variability in dynamical systems. The relevance of this class of models is growing in 

h— 5 many applied research areas and is already a standard tool to model e.g. financial, neuronal 

\Q and population growth dynamics. However inference for multidimensional SDE models is still 

very challenging, both computationally and theoretically. Approximate Bayesian computa- 
["T^ tion (ABC) allows to perform Bayesian inference for models which are sufficiently complex 

^ that the likelihood function is either analytically unavailable or computationally prohibitive 

^ ; to evaluate. By exploiting the properties of the considered ABC methods a computationally 

Cd efficient MCMC algorithm is proposed, halving the running time of our simulations. Focus is 

c/2 on the case where the SDE describes dynamics affected by measurement error, the latter be- 

ing a non-negligible source of variability (and inferential complications) for most biomedical 
'nJ" and biostatistical applications. Simulation studies for a pharmacokinetic/pharmacodynamic 

model and for stochastic chemical reactions are considered. 

in 
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1 Introduction 



Approximate Bayesian computation (ABC) is a "likelihood-free" methodology which is enjoying 



increasingly popularity as it provides a practical approach to perform inference for models that, 
due to likeUhood function intractability, would otherwise be computationally too challenging to 



be considered (Sisson and Fan 2011 Marin et al. 2012). 



The class of models under study is the one defined by diffusion processes, that is solutions to 
stochastic differential equations (SDEs). These allow for the modellization of dynamics subject to 
random fluctuations, thus providing a mean to represent non-deterministic behaviours as observed 
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in many applied areas, such as neuronal modelling, financial mathematics, population dynamics, 
pharmacokinetics/pharmacodynamics and modelling of physiological and chemical dynamics. In- 
ference for SDEs has produced a large body of literature in the last twenty years. In this work we 
are primarily concerned with the estimation of unknown parameters in SDE models given mea- 
surements sampled at discrete times; in particular we attempt at using Bayesian inference. As 
we see in a moment there are several issues affecting the success and practical implementation of 
(exact) Bayesian methodology for complex multidimensional SDE models, and we consider ABC 
methods to tackle such difficulties. 

Suppose Xt = Xt{il^) is a given (possibly multidimensional) continuous stochastic process 
representing the state of a system at time t, depending on a vector of unknown model parameters 
i/j that we wish to infer from data. Also suppose Xf is solution to a SDE (i.e. Xt is a "diffusion 
process" ) which is explicitly introduced in section [2] Depending on the application scenario the 
initial state of the system Xt^ might be known and set equal to a constant Xtg = Xq otherwise we 
consider it as an unknown parameter; in the latter case 1/' includes Xq. In general we may consider 
Xf as unobservable and only known up to some error St preventing the exact measurement of Xf 
to be attained. Therefore in the most general formulation of the problem in correspondence to a 
set of n + 1 discrete time instants to < ti < ■ ■ ■ < tn {to > 0) there is a set of "corrupted" versions 
of Xo,Xi, ...,Xn {Xi = XtJ denoted with t/o, y„, where the generic yi (= ytj is a draw 

from some probability distribution underlying the following error-model 

Y, = f{Xi,e,), 2 = 0,l,...,n (1) 

where /(■) is a known real- valued vector function and the Si are independent draws from a prob- 
ability distribution independent of Xi. 

For illustration purposes we consider the additive case f{Xi, Si) = Xi+Si. However such choice 
does not affect theoretical developments. A further non-constraining assumption is to consider £i 
as being i.i.d. with mean zero and unknown covariance matrix, e.g. diagonal covariance cr'^Idim{Yi) 
(we always consider the case of homoscedastic variances), where in realistic experimental scenarios 
the value of has to be inferred from data. These assumptions on the error are set for simplicity of 
exposition, and more complicated error structures can be considered without affecting theoretical 
considerations (e.g. correlated, state-dependent errors). Ideally we wish to make inference about 
6 = (i/j, cTg) using Bayesian methods, i.e. given y = {yo,yi, •••,2/n)"^ we want to sample from the 
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posterior density 7r{0\y) (here T denotes transposition). However for many complex models not 
only a closed form expression for TT{6\y) is unavailable, but also very general Monte Carlo Markov 
chain (MCMC) methods such as Metropolis-Hastings may fail for a number of reasons, including 
e.g. difficulties in exploring the parameter space (low mixing in the chain), multimodality in the 
posterior surface, difficulties in constructing adequate proposal densities for Xt- Considering that 
for SDE models simulated trajectories are by nature highly erratic, distance from the observed 
data might turn unacceptably high for an MCMC algorithm based on acceptance/rejections of 
proposals, even if the parameters starting values are located in the bulk of the posterior density. 
Inference for SDEs is particularly difficult when either the explicit solution of the SDE or the 
transition density of the underlying diffusion process are unavailable. For some ad hoc models or 
for sufficiently simple applications successful inferential strategies for SDE models are available 
(see the reviews in Hurn et al. (2007) and S0rensen ( 2004[ )), however in many applied areas the 
aforementioned difficulties prevent the application of SDE models, particularly in situations where 
many parameters need to be estimated over multiple dimensions. Furthermore the presence of 
measurement error is a factor complicating the inference considerably. 

In this work we propose a feasible route to approach a general class of (multidimensional) SDE 
models, by exploiting recent results in (likelihood-free) ABC methodology, circumventing the eval- 
uation of the intractable likelihood and targeting an approximation to the posterior TT{6\y). A 
computationally efficient MCMC algorithm is offered, exploiting the considered ABC methods and 
accelerating computations by 50%-60% in our experiments. Simulation results for a pharmacoki- 
netic/pharmacodynamic model and stochastic chemical reactions are considered. Likelihood-free 



approaches for Bayesian inference in SDE models have already been studied (e.g. in Golightly and 



Wilkinson (2011)), potentially resulting in exact inference (except for the necessary discretisation 



error to perform forward simulation from the SDE): here we consider a more common scenario, 
where exact inference is assumed not feasible and data and simulated values are compared via 
"summary statistics" , see section [3j Notice that the simulation machinery we consider is MCMC 
and an alternative approach using sequential Monte Carlo is implemented in the ABC-SysBio 



package (Barnes et al. , 2011). 
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2 The inferential problem for SDE models 



We are concerned with the problem of finding estimates for the vector-valued parameter = 
(i/', parametrizing an SDE model, via and an error model for observations contaminated 
with measurement error, via as- Specifically we assume that dynamics of a given system are defined 
via a d-dimensional time-inhomogeneous (Ito) stochastic differential equation for the system state 
Xt at time t >to 

dXt = fiiXt, t, xl^)dt + (T{Xt, t, 'iP)dWt, (2) 

with Xq = XtQ G M'^ a known constant Xq or an unknown random quantity with density function 
7r(a;o) (in the context of Bayesian statistics we can say that vr(a;o) is the prior density of Xq), and in 
the latter case we consider Xq as an element of 0. is a d-dimensional real-valued vector, cr(-) 
is a dxm matrix and dWt ~ A^m(0, Imdt) represents independent increments of an m-dimensional 
standard Brownian motion (A^^ is the m-dimensional multivariate normal distribution and Im is 
the m X m identity matrix). We assume that standard regularity conditions for the existence and 
uniqueness of a solution for ^ are met (0ksendal, 2003). In some applications we assume that 



Xt is not observed directly, that is we consider Xt as a latent stochastic process and a "version" 
of Xt denoted with is observed instead. For example, as mentioned in the Introduction we 
might consider the "error model" Yt = f{Xt,St) and the model object of study becomes 

dXt = ^J,{Xt,t,^jJ)dt + (T{Xt,t,'^P)dWt 

(3) 

Yt = f{Xt,et), £t ~ 7r(£t|(Tj). 

Therefore when data are measurement-error-free, i.e. have been generated by model (|2]), in- 
ference is about = ip and is based on data x = (xo,i, ...,xo,d, ...,Xi^i, ...,Xi^d, ■■■,Xn,d)'^, Xij being 
the observed value of the jth coordinate of Xt at time ti {i = 0, 1, ...,n; j = 1, ...,d). Otherwise 
inference is about = (i/?, cTg) (which might realistically include Xq G M.^) and data are avail- 
able through ^ and are denoted with y = (?/o,i> ■■■,yo,d, ■■■,yi,i, ■■■,yi,d, ■■■,yn,d)^ using an obvious 



notation. An extended notation for partially observed systems is considered in section |4.2[ All 
the concepts that follow can be indifferently applied to both cases. In the context of using the 
Bayesian paradigm to estimate 0, in section |3] ABC methodology is introduced. 
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3 Background on ABC and Likelihood-free methods 



Here we briefly review some of tlie tlieory underlying ABC which is a subset of Likelihood-free 



methodology (e.g. Sisson and Fan (2011), Marin et al. (2012)) with focus on the implementation 
of a Metropolis-Hastings algorithm. For ease of notation we use lower case letters to denote 
random variables. This section is not specific to dynamic/SDE models, and generic vectors x and 
y are used to denote (latent and observed) quantities which the reader might want to associate 
to X = (a^o, Xi, XnY and y = (j/o, yi, ynY ^ase of interpretation. 

One of the most powerful results in MCMC theory is the Metropolis-Hastings (MH) algorithm: 
MH can be used to generate a Markov chain "targeting" a given probability distribution under 



rather mild regularity conditions, see Brooks et al. (2011). It is possible to target the posterior 
distribution 'K{x,6\y) by using an arbitrary kernel g((0#, a;#), (a;', 0')) to propose a move from 
{6#,x^) to {6\x') with acceptance probability 

'n{e')'n{x'\e')'n{y\x\ e')q{{e\ x'), (0#, x#)) 



a{ie#,x#),{x',e')) 

where ratio 



1 A 



7i{e#)7T{x#\e#)7v{y\x#, e#)q{{d#, x#), {O', x')) 

n{e')n{x'\e')n{y\x', e')q{{e', x'), (%, x#)) 



1 A ratio 



(4) 



7^{e#)7T{x#\e#)n{y\x#, e#)q{{e#, x#), (6>', x')) ■ 

If the proposed move is accepted we set {x^, 0^) := {x', 6') otherwise the chain stays at {x^, 0#): 
here and in the following the notation := means substitution. In practice by generating a uniform 
draw w ~ f/(0, 1) a proposal is accepted ii u < ratio and rejected otherwise. 

For subsequent development it is convenient to specify our proposal kernel as q{{0^, x^), {x', 6')) 
u{0'\Ojf,)v{x'\0'). Generating a proposal x' from v{x'\6') might be not trivial when considering 
the problem of simulating a path for the (unavailable) SDE solution, as usually the resulting chain 



mixes poorly, see Eraker (2001) and Golightly and Wilkinson (2008) for remedies. However if we 
consider the special choice v{x'\6') = 7r{x'\0') this has the advantage of simplifying the acceptance 
probability to be likelihood-free as ^ becomes 

7rie')niy\x',e')u{0#\e') 



a{{e#,x#),{x',e')) = lA 



(5) 



nie#)n{y\x#,e#)u{e'\e#y 

Now the problem of generating a sufficiently accurate proposal (trajectory in our case) x' is 



actually a non-issue for SDE models as several approximation schemes are available (Kloeden and 



Platen, 1992) and for one-dimensional SDEs it is even possible to simulate the solution exactly 



(Beskos et al. (2006)). Therefore, if an x' drawn exactly from t[(x'\6') is not available we can at 



least produce a draw a;'^ from TTh{x'\0') where h is the stepsize used in the numerical method of 
choice (e.g. Euler-Maruyama, Milstein, Stochastic Runge-Kutta) to discretize the SDE and obtain 
a numerical solution x'f^, while 7rh{x'\6') is the density function for the corresponding distribution 
law associated with the numerical scheme. What is important to recognize is that knowledge of 
7^h{-) is not necessary as this gets simplified out as described above (for h 0), whereas the only 
relevant fact is having "somehow" generated an x' (if feasible) or an x'f^, and in the latter case 
^ would depend on h. Notice that e.g. the Euler-Maruyama scheme which is often employed for 
its simplicity converges to the exact SDE solution as — t- with strong order of convergence 0.5 



(Kloeden and Platen 1992), and therefore the approximated acceptance probability ah converges 
to a as /i — )■ 0. 

It is now clear that a feature making likelihood-free methodology attractive is that it's usually 
easier to simulate realisations from models (|2])-([3]) than to evaluate their likelihood functions. As 
previously mentioned high rejection rates are typical in Bayesian inference for SDE models, when 
simulated trajectories are not close enough to the observed y (even when the current value of 6 
lies inside the bulk of the posterior density), and in such context ABC methods turn useful, as 
motivated in next section. 



3.1 ABC MCMC methods 



An approximation to the previously discussed likelihood-free methodology is given by accepting 
proposals which are sufficiently close to y according to some choice of summary statistics, a given 
metric and a tolerance value. This approach is at the heart of approximate Bayesian computation 
(ABC). If we denote with S{-) a chosen (vector of) statistics that we apply to both x' and y to 
obtain S{x') and S{y), we accept x' if e.g. \S{x') — S{y)\ < 6 for some tolerance 5 > or more 
in general if p{S{x'), S{y)) < 6 for some metric p(-). The chain obtained via MCMC targets 
7i{6\p{S{x'), S{y)) < 6). Therefore the closeness of x' to y is measured via the closeness of S{x') 



to S{y). See Sisson and Fan (2011) and Marin et al. (2012) for a general introduction to ABC 



methods. A frequently considered approach is to approximate 7r(^/|a3,0) with 



ns{y\x,e)=psiSix),Siy)) 



lJ\Six)-Siy) 



(6) 



s \ s 

where K{-) is a smoothing kernel density centred at S{x) = S{y) and 6 takes the role of band- 
width. This way ps{S{x), S{y)) has high values when S{x) is close to S{y). 
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The reasoning above can be merged with the previous hkehhood-free considerations, partic- 
ularly with the relevance of not being in need to know 7i{x\0) (or TTh{x\6)) but only be able to 
generate a proposal x' (x'^ respectively), and using 7rs{y\x,6) in place of TT{y\x,6) to weight 
closeness of simulated trajectories to data in We consider as a basic building block for the 



subsequent developments an ABC MCMC algorithm discussed in Sisson and Fan (2011) (there 



denoted "LF-MCMC"), which is a generalization of the algorithm proposed in Marjoram et al. 



(2003), and specify it for vr^ written as in We additionally consider the case of observations 



perturbed via measurement error and the fact that 6 itself could be considered as an unknown 
quantity whose optimal value can be determined a-posteriory by examining its own Markov chain, 
as suggested in Bortot et al. ( 2007[ ) (see also section 5.2). This procedure is given in Algorithm [l] 



We recall that (i) in the limit as 5 — )■ and in addition (ii) when S{-) is sufficient for 6, algorithm 

Algorithm 1 An ABC MCMC algorithm with augmented state-space 

1. Initialization: choose or simulate 6start ^ 7r(^), simulate Xstart T^ix\0start) and Vstart ~ iy\x start, start)- 
Fix dstart > and r = 0. Starting values are {9r,Sr) = (0 start, S start) and S{ysim,r) = S{ystart)- 

At (r + l)th MCMC iteration: 

2. generate {9',S') ^ u{9,d\6r,Sr) from its proposal distribution; 

3. generate x' ^ tt{x\6') from its distribution law conditional on the 9' from step 2; generate ysim ^ '^{uW , 9') 
and calculate S{ysim)', 

A with nrnhabilltv 1 A ^ie')^{S')K{\S{y,,^)-S{y)\/5')u{e^,5Ae' ,S') /fl , A , q(„ - A\ — ( R' A' Q(„ - ^^ 

4. Wltn probaonity i A ^(Q^)T,i^s^)K(\S(y,i^,^)-S{y)\/S^)u(e' ,S'\e^.Sr) \^r+l,Or+l, '^{ysim.r+l}} \tf ,0 ,!^\ysim)) 

otherwise set {9r+l,5r+l, S{ysiyn,r+l)) ■= {9r, Sr, S{ysim.r)); 

5. increment r to r + 1 and go to step 2. 

[l] allows for inference from the exact marginal posterior iT(6\y) (we turn the attention at how to 
compensate for the typical lack of conditions (i)-(ii) in section 4.1). 

Regarding our original inferential problem of estimating parameters for an SDE model, we 
consider x = {xq,xi, ...,Xn)^ as the simulated values obtained from the (analytic or numeric) 
solution of ([2]) at times {to, ^i, ^n}- When the SDE's analytic solution is not available the values 
in X can be obtained via numerical discretization methods such as Euler-Maruyama, Milstein, 



or based on higher order Ito-Taylor approximations (see Kloeden and Platen (1992) for a com- 
prehensive review). Once a solution to ^ is obtained, when d = 1 we define the interpolated 
values at observational times to,ti,...,tn as x' = {xq,Xi, ...,Xn)'^ (or x'f^ = {x^q\ ...,x^n^)'^ for the 
approximated solution obtained with a stepsize h). For a generic value of d we concatenate the 
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interpolated values as x' = (xo,i, ...,Xo,d, ...,Xi^i, ...,Xi^d, ■■■,Xn,d)'^ , being the simulated value of 
the jth coordinate of Xt interpolated at time ti {i = 0, l,...,n; j = l,...,d). By building upon 
algorithm [T] in next section we propose a new MCMC algorithm for ABC resulting in a dramatic 
reduction in computational times whenever a specific choice for the kernel K{-) is made. 



4 MCMC acceleration via Early— Rejection 

It turns out that for some choice of the kernel K{-) the previously described MCMC algorithm 
can be significantly accelerated . The trick consists in avoiding the simulation of the SDE model 
as soon as it is known that the proposed parameter will be rejected. Under some conditions this is 
trivial to be verified and the speed-up is achieved by simply switching the order of the calculations 
in the Metropolis-Hastings algorithm: we first generate the uniform random number u ~ U{0, 1) 
which has to be simulated in order to evaluate whether to accept/reject the proposed parameter, 
then the ratio of priors and proposal densities is evaluated and when u is larger than such ratio 
we can immediately reject the proposal without simulating from the SDE model. We now proceed 
to specify a suitable choice for K{-) and then motivate how the acceleration speedup is achieved. 



Following Fearnhead and Prangle (2012) we always consider K{-) to be the uniform kernel K{z) 
returning 1 when z'^Az < c and otherwise. In our case z = \S{ysim) — S{y)\/6 and A is chosen 
to be a p X p diagonal matrix defining the relative weighting of the parameters in the loss function 



(we discuss the determination of A again in section 5.1 ). The uniform kernel is defined on a region 
z^Az bounded by a volume c which is taken to be c = where Vp = 7r~^[r(p/2)p/2]^/^ 

(here n denotes the mathematical constant vr = 3, 14...). Such c is the unique value producing a 
valid probability density function K i.e. such that the volume of the region z^Az < c equals 1, 
see 



Prangle (2011). 



Interestingly, it is possible to achieve a considerable computational acceleration (about 50-60% 
in our applications) by appropriately exploiting the binary nature of such K{-). The idea works as 
follow: by considering a uniform kernel as previously specified, the kernel K{-) in the denominator 
of the acceptance probability always equal 1, and in practice this is necessary for the algorithm 
to start. Therefore the acceptance ratio simplifies to: 
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What is mostly relevant is that, because of the binary nature of K{-), we first check whether 
u > {7i{0')7i{6')u{6r, Sr\0', 5'))/ {'K{0r)-n{5r)u{6' , 5'\6r, 5r))- if such couditiou is verified then we can 
immediately reject {0',6') (i.e. without having to simulate from the SDE model!) as in such case 
u is always larger than the ratio, regardless the value of K. In case the check above fails, then we 
are required to compute K{\S{ysim) — 'S{y)\/S') and accept the proposal ii u < ratio and reject 
otherwise. The procedure is coded in Algorithm [2] Such computational trick implies considerable 

Algorithm 2 Early-Rejection ABC MCMC 

1. Initialization: choose or simulate Ostart ~ Tr{0), simulate Xstart ~ i^{x\0start) and ystart ~ T^(y\x start, start)- Fix Sstart > 
and r = 0. Starting values are (0r,5r) = {Ostart, Sstart) and S{ysim,r) = S{ystart) such that K{\S{ystart) - S{y)\/5start) = 1- 
At (r + l)th MCMC iteration: 

2. generate {6',S') ~ u{0,5\0r,Sr) from its proposal distribution; 

3. generate w ~ (7(0, 1); 
if 

7T{e')n{S')u{er,5r\e',S') 



OJ > 



n{0r)TT{Sr)u{g',5'\0r,Sr) 



(= "ratio") 



then 



{0r+i,Sr+i, S{ysim,r+i)) ■= {Or, Sr, S{ysim,r)); > (proposal rejected) 

else generate x' ~ tt{x\0') conditionally on the 0' from step 2; generate ysim ~ ''^{y\x' , 0') and calculate S{ysim)', 
if K{\S{ys^m) - S{y)\/5') = then 

{0r+i,Sr+i, S{ysim,r+i)) ■= {Or , Sr , S {ysim,r)) > (proposal rejected) 

else if a; < ratio then 

{0r+i,Sr+i,S{ysim,r+i)) ■= {0' , S' , S{ysim)) > (proposal accepted) 

else 

{Or + l,Sr + l, S{ysim,r+\)) '■= {Or , Sr , S{y sim,r)) 

end if 
end if 

4. increment r to r + 1 and go to step 2. 



> (proposal rejected) 



benefits at zero cost, particularly for rejection-based MCMC algorithms where typically only 
a moderately low fraction of draws is accepted (about 24% under some asymptotic conditions. 



Roberts et al. (1997)) and the gain is even more evident when using ABC, which is characterised 



by low acceptance rates (e.g. Fearnhead and Prangle (2012) and Sisson and Fan (2011) reported 
results obtained with a 1% acceptance rate). As a result the possibility to avoid simulating 
from the SDE model comes with obvious computational benefits. The idea itself is not new. 



and has been considered in e.g. Beskos et al. (2006) and more recently in Solonen et al. (2012): 



in particular the latter coined the expression "early-rejection", which we employ. However to 
the best of our knowledge early-rejection has not been previously considered within the ABC 
framework. Notice that the acceptance probabilities in algorithms ll]-[2] have simpler expressions 
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when the proposal distribution u{-) is such that u{0r\6') = u{6'\0r) = g{6' — 6^) with g{-) a 
symmetric distribution. In our simulations we always use the adaptive Metropolis random walk 



with Gaussian increments proposed in Haario et al. (2001), and in such case the ratio in step 3 of 



algorithm |2] becomes {ti(6')tx{6')) / {'R{6r)Ti{8r))- To a careless reader algorithm [2] might appear as 
an unnecessarily complicated procedure to sample from a density being nothing but the joint prior 
7r(0, 5): of course this is not the case. If we were to sample from the prior we would reject when 
u is larger than the ratio in step 3 and accept otherwise, however while the rejection condition is 
consistent with algorithm [2| in our case it is sometimes possible to reject even when u is smaller 
than such ratio, namely when K{\S{ysim) — S{y)\/5') = and as such we are not targeting the 
prior. However a "high" acceptance rate implies that often (a) K{\S{ysim) — S{y)\/6') = 1 and (b) 
u < "ratio" would simultaneously hold and therefore the resulting chain would target a density 
very similar to the prior, which is something we should avoid by not allowing 6 to become too 
large. 

We now have to choose the summary statistics >S'(-). This is a problem of paramount importance 



as it has been emphasized (e.g. in Fearnhead and Prangle (2012)) that whereas the shape of the 



kernel K{-) has negligible impact on the statistical performance of ABC methods, on the opposite 
side it is essential to choose >S'(-) appropriately. 

4.1 Summary statistics 

We consider recent developments allowing to build informative summary statistics, ideally "nearly 
sufficient", although in practice we cannot quantify how close we get to achieve sufficiency. Re- 



cently Fearnhead and Prangle (2012) proposed an ABC methodology making the determination of 
summary statistics almost automatic, by exploiting a classic result holding for least-squares type 
of loss functions. That is they proved that when choosing S{y) = E{0\y), the minimum loss, 
based on inference using the ABC posterior, is achieved as 5 —t- by = E{6\S{y)). Therefore 
in the limit as 5 — t- the posterior mean from the ABC output is E{0\S{y)) = E{6\y), i.e. the 
true posterior mean. Thanks to such strong result a linear regression approach is considered in 



Fearnhead and Prangle (2012) to determine the summary statistics, that is for the jth parameter 



{j = 1, ...,p) the following linear regression model is built 

= E{0,\y) + = + (3^Hy) + 0, J = 1, -.P (7) 
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where (3^^^ is a vector of unknown coefficients and is a mean-zero random term with constant 
variance. Each of the p regression models is estimated separately by least squares and the corre- 
sponding fit + P^-'^'r]{y) provides an estimate for E{6j\y) which can then be used as a summary 
statistic, because of the optimality property of S{y) = E{6\y). 

In our applications we consider the case where ri{y) is the time-ordered sequence of measure- 
ments, e.g. for yi G we set r]{y) = {i/q^i, ...,yo,d, ...,yi,i, ...,yi,d, ...,yn,d), Uij being the observed 
value of the jth coordinate of Xt at time tj {i = 0, l,...,n; j = l,...,d). Of course alterna- 



tive choices could be considered for ri{y), particularly for large ra, e.g. change-points (lacus and 



Yoshida, 2012). Therefore, when e.g. d = 1, our summary statistic function S{-) for 6j when 
applied to y becomes + Pi^yo + (^2^1 + ••• + (^n+iVn (actually we do not need the intercept 
term as in algorithm [2] we have S{ysim) — S{y) which is independent of Pq '^). Of course when it is 
necessary to apply the same criterion on the simulated ysim, e.g. during the "training phase" (see 
section 5.1), then we consider the least squares approximation of E{d\ysim) = Po^ + P'^-'^viysim) 
instead, with rj{ysim) = {ysvm,o,ysim,i, ■■■,ysim,n)- In addition to multivariate linear regression we 



have also experimented with Lasso (Tibshirani (1996)) and found the latter approach flexible 
enough and relatively computationally cheap to be used as a "default" method. We determine the 
penality-parameter for the Lasso by cross validation, and the regression coefficients corresponding 



to the "optimal" penality (as returned by the glmnet software (Friedman et al. , 2010), available 
for both R and Matlab) are used for prediction purposes. 



4.2 Partially observed systems 

Application of ABC MCMC methodology to handle partially observed systems is straightforward. 
With "partially observed" we mean an experiment for which not all the d coordinates of the 
multidimensional process {Xt} are observed at any sampling time ti. The previously introduced 
methodology does not require modifications to handle such scenario, where in general at a given 
time ti the observed yi has dimension di < d where di is the number of observed coordinates at 
time ti, and algorithms [T]-[2] require, same as before, the simulation of a draw x' having dx n 
elements. However this time a smaller vector of artificial data ysim having the same dimensions 
as y is generated, i.e. dim(t/^jm,j) = dim(yi) = di. 
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5 Computational issues 



5.1 Determination of a training region 

Success in the application of the presented methods might be affected by the identification of a 
"training region" for 6. In practice this means that for the previously described methodology 
to work reasonably well, when knowledge about 6 is scarce, we should first implement a pre- 
MCMC simulation that we call "pilot study" (we will see that such pilot is important for the 
example in section [7| but not for section [6]). Fearnhead and Prangle (2012) suggested a schedule 
for the semi-automatic ABC: i) Perform a pilot run of ABC to determine a region of non-negligible 
posterior mass; ii) "Training" : simulate sets of parameter values and artificial data from the region 
determined in (i); iii) use the simulated sets of parameter values and artificial data to estimate the 
summary statistics; iv) run an ABC MCMC algorithm with the chosen summary statistics. Step 
(i) is optional, when from previous knowledge (e.g. literature) information about can be encoded 
into an informative prior. In any case we modify step (i) as it would require performing the whole 
ABC MCMC procedure, and this would not only increase considerably the overall computational 
burden, but it would also require the availability of summary statistics (or to choose them rather 
arbitrarily), and for stochastic dynamical models it is not clear how these could be obtained. 

What we propose as a replacement for (i) is to arbitrarily fix a vague-enough but proper 
prior tiq{6) from which to simulate parameters, e.g. an hypercube. For example when running 
algorithm [2] we simulate "many times" parameters from vro(0), say simpar times, and for each of 
these simulated sets of parameters the noise-free x and then the corresponding ysim are generated. 
We now have simpar sets of parameter (s) and simulated observations from which to compute the 



summary statistics using linear regression or other methods, as described in section 4.1[ That is 
a vector (^1, ...,9p) is obtained by fitting equation ([T]) p times. By repeating independently the 
simulation of the simpar parameters and artificial data a number of times, say simpilot times, 
at the end of such large simulation a matrix having dimensions simpilot x simpar is obtained, 
each column containing simpilot estimates for the corresponding parameter of interest. These 
should help in identifying a region where posterior means E{dj\ysim) are likely to be, given a 
linear /nonlinear model for such expectations and a large number of pseudo-data ysim- We use 
such regions to "recalibrate" the initial prior 7io{6) and obtain new priors vr(0) from which we 
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simulate parameters in steps (ii) and (iv). Notice that during the pilot real data are not used, 



and therefore this strategy to define 7r{6) does not involve any "double use of data"-issue (Berger 



2006). One further advantage resulting from running a pilot is that it helps in defining the relative 
weighting of parameters, namely the entries of the matrix A considered in section |4j marginal 
parameter variances were calculated from the output of each pilot run and the means of these 
{s^, s|, Sp} taken. A diagonal matrix A was formed with ith diagonal entry same as 
in Fearnhead and Prangle (2012). Finally, prior to starting the MCMC, a further (but shorter) 
simulation named "training" is executed again in step (ii), this time by simulating parameters from 
the determined priors vr(0), with simpilot=l and simpar at least double the value of d x n (if 
computationally feasible it should be 10 to 20 times larger than dxn) when using linear regression 
to determine S{-), and the statistics >S'(-) to be used in the MCMC step (iv) are computed on 



such set of simulated values as detailed in section [471) See the example in section |6] for additional 
details. 

The suggested schedule has shown to be very valuable to avoid numerical difficulties, as ex- 
plained later on. For example the numerical discretization of an SDK might produce negative 
trajectories for concentrations of substances, produce complex valued trajectories and all sort of 
numerical infelicities occurring when simulating an in-silico model. The pilot offers valuable help 
in these cases, helping identifying "unadmissible" draws and in some way produce a brute-force 
a-priory analysis of our model. 

5.2 Bandwidth selection via augmented-space 

The parameter determining the accuracy of inference via ABC is the bandwidth 6, whose appro- 
priate value must be chosen carefully as a trade-off between statistical accuracy (low 6) and the 
ability for the monte carlo Markov chain to mix reasonably well, i.e. to explore the support of 
the posterior distribution. Of course 6 can be chosen by trial and error, that is by identifying the 
smallest 6 producing a satisfactory enough acceptance rate. Now the problem is that, as warned in 



Bortot et al. (2007), long simulations are needed as a price to be paid in any algorithm that avoids 



likelihood calculations. Bortot et al. (2007) propose an augmented state-space approach sampling 
from the ABC posterior of (0,6). As shown in algorithm [2| we need to choose a starting value 
for 6, its prior distribution and a generation mechanism for new values of 6 at the rth iteration of 
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the MCMC step. We favour small values of 6, for statistical efficiency, while occasionally allowing 
for the generation of a larger 6 to avoid the problem of poor mixing in the chain. With these 
requirements in mind we always set for 6 a prior exponential distribution tt{6) with mean A, and 
at the rth iteration of the MCMC algorithm a proposal for 6 is generated via Metropolis random 
walk with Gaussian increments. We keep the general notation for the proposal distribution u{-), 
although in practice we always consider u{-) to be the pdf of some Gaussian distribution, and 
therefore the ratio u{0r,Sr\O',6')/u{0',6'\0r,Sr) simplifies to 1. As previously mentioned ABC 
methods normally require long simulations (in terms of number of generated draws) to approach 
stationarity, as evaluation of the likelihood function is avoided. Therefore we now need even longer 
simulations as some of the generated draws are filtered out from the simulation output, namely 
we discard all draws in {Or] Sr > 6*}, where 6* is application specific and is chosen empirically as 



described in next sections. Notice that, same as in tempered-transitions (Neal, 1996) where pro- 
posals for a (multimodal) target density /(•) are produced by transversing the support of similar 
densities from which it's easier to simulate, e.g. /^^^ where r > is a sequence of "temperatures" 
smoothing peaks and valleys on the surface of /(•). In our case using a variable 6 has the same 
goal, namely approximately sample from the (exact but unavailable) posterior corresponding to 
5 = by considering a small 6 > while occasionally using larger 5's to ease acceptance (and thus 
exploration of the posterior surface). 

In order to ease the application of the methodology introduced in this work to complex high- 
dimensional stochastic models defined via SDEs, and to offer a fairly general implementation, a 
Matlab package is in preparation and will be made freely available on the author's webpage 



www. 



maths . 1th. se/matstat/staf f /umberto/l In next two sections simulation studies are considered: 



these could be treated using exact inferential methods, e.g. particle MCMC (Andrieu et al. (2010)), 
however it's interesting to look at ABC MCMC performance for such examples. Also notice that 
sequential Monte Carlo, although often preferred to MCMC in terms of statistical efficiency (and 
faster convergence), requires the generation of several particles, resulting in heavier computations 
for the evaluation of a single proposal. For the simpler example in section |6] a comparison against 
a sequential Monte Carlo algorithm is presented. 
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6 Theophylline pharmacokinetic 



Here we consider the Theophylline drug pharmacokinetic, which has often been studied in lit- 
erature devoted to longitudinal data modelling with random parameters (mixed-effects models). 



We follow a setup as similar as possible to Pinheiro and Bates (1995) and Donnet and Samson 



(2008), although our model does not contain random parameters. We denote with Xt the level 
of Theophylline drug concentration in blood at time t. Consider the following non-authonomous 
SDE model: 

where Dose is the known drug oral dose received by a subject, is the elimination rate constant, 
Ka the absorption rate constant, CI the clearance of the drug and a the intensity of intrinsic 
stochastic noise. The experimental design for a single hypothetical subject considers nine blood 
samples taken at 15 min, 30 min, 1, 2, 3.5, 5, 7, 9 and 12 hrs after dosing. The drug oral dose 
is chosen to be 4 mg. At time to = the drug is administered and therefore at to the drug 
concentration in blood is zero, i.e. Xq = xo = 0; thus it's reasonable to assume the SDE initial 
state to be known and as such will not be estimated. The error model is assumed to be linear. 
Hi = Xi + Ei where the Ei ~ A^(0, a^) are i.i.d., i = 1, ...,9. Inference is based on data {yi, 1/2, 1/9} 
collected at times ti = 15 min, ^2 = 30 min, etc. Parameters of interest are {Ke, Ka, CI, a, a^) 
however in practice to prevent the parameters from taking unrealistic negative values their natural 
logarithm is considered, thus we set 6 = (log -ft^e, log ii'a, log C/, log cr, log cr^). 



Equation (^ is linear in the state variable and a solution is readily available (see Kloeden 



and Platen (1992)), therefore data have been simulated exactly. For data generation we used 



(log i^Te, log i^a, log C/, log (T, log (T^) = (-2.52, 0.40, -3.22, log 7(12, log VoT) and interpolated the 
obtained trajectory at sampling times to get n = 9 values for process {Xi} which we perturbed 
via the error model with cr^ = 0.1 and obtain the ?/j's. However in order to check the performance 
of our inferential algorithm in a typical scenario where closed-form solutions are unavailable, in 
the several steps of ABC we always approximate the SDE solution via Euler-Maruyama on a fine 
time-grid obtained by dividing each interval [ti,ti+i] in 20 sub-intervals of equal length, same as 
in 



Donnet and Samson (2008). 
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ABC training with multivariate linear regression 



The pilot study described in section is executed by setting Gaussian priors ttq on parame- 
ters, namely logK^ ~ N{-3,2'^), log Ka ~ N{0.2,1), \ogCl ~ A^(-2.5,l), log a ~ N{-1,0.5'^), 
logcJe ~ A^(— 1.5, 0.3^) where N{a, b) denotes the Gaussian distribution with mean a and variance 
b. During the pilot simpar = ?2x 30 = 9x 30 = 270 draws are generated for each parameter 
from the priors above (that is, in order for the linear regression to produce informative predic- 
tions we create a set of simulated parameters which is 30 times larger than the number n of 
columns/predictors in the "design matrix" specified below) and, conditionally on such draws, the 
corresponding 270 vectors x are generated and finally the 270 ysim, where each ysim has the 
same length as the actual set of observations. Then for each of the five parameters of interest a 
multivariate linear regression model is fitted separately, with response vector and design matrix 
respectively given by 
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where 9j is the sth draw from the prior of the jth element of {s = 1,...,270) and yfj^.i is 
the ith value from the ysim generated in the sth simulation, i = 1,...,9. Using the notation 
introduced in equation ([T]) we denote with (3^, (31, the obtained least squares estimates from 
linear regression, from which we obtain an estimate of 9j according to ([T]) (separately for each 
j). The procedure just described is iterated independently for simpilot = 200 times, each time 
generating draws from the prior ttq, the corresponding artificial observations and finally predicting 
the value of 6j by linear regression. The resulting simpilot vectors of predictions for all the 
components of 6 are collected in a matrix having dimensions simpilot x p which is the result of 



the pilot study. The pilot also produced the diagonal matrix A, as described in section 5.1 its 
entries are [0.60, 1.02, 1.27, 4.25, 11.33]. 

By examining each column of the simpilot xp matrix we obtain information to define the cor- 
responding prior irlOj) that is used in the "training phase" and in the the ABC MCMC procedure. 
For example histograms for each each column could be produced, the one for parameter log Kf, com- 
ing from the first column of such matrix, the one for logA^a from the second column etc. We there- 
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fore select the following priors to be used in the ABC MCMC: logi^e ~ A^(-2.4, 0.9^), logi^^ ~ 
Ar(0.33, 0.552), _ A^(_3^o.72), loga ~ A^(-l.l, O.I52) and loga^ ~ A^(-1.5, O.O?^). 



ABC MCMC with multivariate linear regression and comparison with particle MCMC 

Prior to starting the MCMC values for S{-) must be obtained: we use the prior densities de- 
rived from the pilot to simulate once (simpilot = 1) simpar = 100 x n = 900 parameters and 
corresponding artificial observation, then use model ([T]) to fit separately each set of simulated 
parameters 6j . The prior densities selected from the pilot are used in algorithm [2] and their means 
provide a starting value for 6. The MCMC procedure is run for 3,000,000 iterations by assigning 
to the bandwidth 6 an exponential prior with mean A = 0.1 and imposing a maximum allowed 
bandwidth of 0.6 (this is not the 6*, which we always determine a-posteriory from the results of 
the MCMC procedure, but is simply a threshold we decided to impose on the proposed values 
for S generated via Metropolis random walk). By using the adaptive Metropolis random walk of 



Haario et al. (2001) we achieve a 20% acceptance rate. Notice that in ABC a high acceptance 



rate perse is not an indicator of good algorithmic performance, as it is always possible to increase 



the acceptance rate by enlarging S, for example Fearnhead and Prangle (2012) obtained good 
results with a 1% acceptance rate. Of the 3,000,000 generated draws every 150th draw is retained 
for inferential purposes (i.e. we used a thinning equal to 150), this is either to break the strong 



correlation between subsequent values, which is typical in ABC (Bortot et al. , 2007), and to save 
computer memory. From the remaining 20,000 draws the first 666 are discarded (corresponding to 
a burn-in of 100,000 iterations), and we are left with about 19,300 draws to consider. In the spirit 



of Bortot et al. (2007) we plot the posterior means and 95% confidence bands for the chain of each 
parameter in dependence of the bandwidth 6, see Figure [TJ The purpose of these plots is to find 
a region {6; 6 < 6*} where the parameter posterior means are reasonably smooth and then filter 
out all draws generated from values of 6 outside such region, i.e. consider for the final results only 
those draws {^r}r=i,2,... having {6r]6r < 6*}. Too small values for 6* would provide unreliable 
results for the posterior means as (i) these would result very variable (e.g. for 6 < 0.2, see e.g. the 
panel for logi^"a), by being computed on a very limited number of MCMC draws or (ii) because for 
very small 5's the chain does not mix well (this is discussed further in section [T]), while too large 
values would return statistically biased results. Therefore a trade-off is necessary, namely select a 
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Figure 1: Theophylline example: a subset of marginal posterior means from ABC MCMC [±2 SD] vs bandwidth. 



small 5* while retaining a long enough sequence of draws to allow accurate posterior inference. It 
is useful that we can make such choice retrospectively although this implies the need to produce 
very long MCMC sequences to balance such filtering procedure, especially when considering the 
necessity for a hard thinning in presence of non-negligible autocorrelation. By choosing 5* = 0.3 
we are left with about 12,100 draws showing negligible auto-correlation. Finally the posterior 
means from the selected 12,100 draws and 95% central posterior intervals are given in Table [l] and 
results seem encouraging. In terms of computational time it required about 1.4 hrs to run the 
MCMC part of algorithm |2| that is generating the 3,000,000 draws, using a Matlab R2011a code 
running on a Intel Core 17-2600 CPU 3.40 GhZ with 4 Gb RAM. Using algorithm [T] it required 
about 3.5 hrs, therefore on this example our early-rejection approach produced a computational 
acceleration of about 60%. We further explored possibilities for improvement by not performing 
the pilot procedure, that is we used the initial priors tto{0) when running the "training" phase (for 
the determination of summary statistics) and algorithm [2] results are in Table [l| Apparently since 
the 7ro(0) are enough informative, the restriction of the simulation support for the training phase, 
when switching to 7^(0) is not beneficial for the inferential results. When the pilot is not used 
we can notice some improvement in this example, in terms of point estimation, at the expense of 
wider posterior intervals. In section [7] for a more complex example with vague priors it will be 
evident how the pilot results essential. 

In order to compare our results against non-approximate (up to Euler-Maruyama discretisa- 
tion error) Bayesian inference we considered the particle marginal Metropolis-Hastings (PMMH) 



methodology originally proposed in Andrieu et al. (2010). Here the particle weights are computed 



according to Golightly and Wilkinson (2011), the latter giving a likelihood-free approach using 
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True parameters 
Method: 

ABC with lin. regr. 
ABC with lin. regr.* 
ABC with lasso. 
ABC with lasso* 
PMMH 



= 0.080 



0.098 [0.037, 0.212] 
0.087 [0.018, 0.485] 
0.089 [0.034, 0.192] 
0.110 [0.028, 0.335] 
0.096 [0.033, 0.208] 



Ka = 1.492 



1.557 [0.643, 3.103] 
0.962 [0.140, 4.005] 
1.526 [0.858, 2.553] 
1.465 [0.224, 5.017] 
1.499 [1.027, 2.112] 



CI = 0.040 



0.054 [0.030, 0.095] 
0.048 [0.018, 0.094] 
0.053 [0.029, 0.100] 
0.054 [0.025, 0.096] 
0.034 [0.013, 0.069] 



cr = 0.450 



0.330 [0.251, 0.437] 
0.359 [0.147, 0.825] 
0.329 [0.283, 0.384] 
0.347 [0.140, 0.819] 
0.793 [0.422, 1.212] 



C7e = 0.316 



0.223 [0.195, 0.253] 
0.222 [0.129, 0.397] 
0.223 [0.207, 0.241] 
0.223 [0.130, 0.378] 
0.266 [0.140, 0.480] 



Table 1: Theophylline example: posterior means from the JVICIVIC output and 95% central posterior intervals when 
ABC statistics are computed via multivariate linear regression and lasso. Asterisks denote results obtained without 
using the pilot to determine priors. Last line reports estimates obtained with particle marginal IVIetropolis-Hastings. 

diffusion bridges wliicli is suitable for inference in SDE models. We run the Golightly-Wilkinson's 
PMMH algorithm for 500,000 iterations as it quickly reaches stationarity and used 100 particles. 
Forward simulation of SDE paths is performed using the Euler-Maruyama scheme on a time-grid 
having 20 intermediate time-steps between sampling times tj and tj+i. In PMMH we used the 
priors 7ro(^). By looking at Table [l] it's interesting to notice that, with the exception of Ka and 
(Te, results are not uniformly improved with respect to ABC; actually the second row in Table [T] 
in some cases reports better [K^, cr) or comparable point estimates {CI). 

Results obtained with Lasso 



We now consider results obtained when using lasso ( [Tibshiranl (1996)) for the estimation of 



ABC summary statistics. The prior densities vr(^j) specified using the results from the pilot 
are centred around roughly the same means of priors obtained from linear regression, however 
for lasso standard deviations resulted smaller: logi^e ~ A^(— 2.4, 0.7^), log Ka ~ A^(0.31, 0.3^), 
logCl ~ A^(-2.96,0.55), loga ~ iV(-l.ll, O.OS^) and loga^ ~ iV(-1.5, 0.042). rj.^^ means from 
such priors are used as starting values for in the MCMC procedure. By inspecting plots similar 
to those in Figure [l] (not reported) only the draws having bandwidth smaller than 5* =0.4 are 
retained and the remaining 34,000 draws used for inferential purposes (a thinning of 50 was enough 
to obtain near-independence), see Table [l] In this application lasso does not seem to produce 
point estimates markedly different from linear regression, except for an improved estimation of 
Kg. Kernel smoothing approximations to the obtained marginal posteriors for some parameters 
are in Figure [2] these are compared against the corresponding priors 7c{0j) and the posteriors 
from PMMH. For log K^. the ABC and PMMH posteriors are almost overlapping and difficult to 
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distinguish. 




(a)logi^e (b)logifa (c)logCT 

Figure 2: Theophylline example: marginal posterior densities from the ABC MCMC output (solid lines) where 
summary statistics have been obtained via lasso; prior densities 7r(0j) (dashed lines) and marginal posteriors from 
PMMH (triangles connected with lines). True parameter values are marked with vertical lines. 



In conclusion, having obtained sufficiently good results with the considered methodology is re- 
markable, given the several approximations involved, namely the use of numerical approximations 
to the SDE solution, the use of non sufficient statistics S{-), a value for 6 which is not zero and 
the limited amount of data n subject to measurement error. 



7 Stochastic kinetic networks 



Here we consider a fairly more complex simulation study, involving a multidimensional SDE and 
using several experimental setups, namely fully and partially observed systems, with and without 



measurement error. We study a model proposed in Golightly and Wilkinson (2010) for biochemical 
networks. Consider the set of reactions {-Ri,i?2, ■■■yRs} defined by 



i?i : DNA + Pa ^ DNA ■ Pa 
Pa : DNA DNA + RN A 
Ps : 2P ^ Pa 
P7 : RNA -> 



Pa : DNA ■ Pa ^ DNA + Pa 
P4 ; RNA RNA + P 
Pe : Pa ^ 2P 
Pg : P ^ 0. 



These reactions represent a simplified model for prokaryotic auto-regulation based on the mecha- 
nism of dimers of a protein coded for by a gene repressing its own transcription. The "reactants" 
and "products" in a specific reaction in a model can be represented via the stoichiometry matrix, 
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including the reactions that a specific "species" {RNA, P, P2, DNA ■ P2, DNA) is part of, and 
whether the species is a reactant or product in that reaction. 

Consider the time evolution of the system as a Markov process with state Xt-, where Xt 
contains the number of molecules (i.e. non-negative integers) of each species at time t, i.e. Xt = 
{RNA, P, P2, DNA-P2, DNA)^. In a stoichiometry matrix reactants appear as negative values and 
products appear as positive values. However in practice for many biochemical network models the 
presence of conservation laws lead to rank degeneracy of the stoichiometry matrix S and therefore 
redundant species should be eliminated from the model prior to conducting inference. In our case 
there is just one conservation law, DNA ■ P2 + DNA = k, where k is the number of copies of this 
gene in the genome, which is reasonable to assume to be known. We can simply remove DNA ■ P2 
from the model, replacing any occurrences of DNA ■ P2 with k — DNA. Therefore we have the 
following stoichiometry matrix 

/oOlOO 0-1 o\ 
1 -2 2 -1 
-110 1-10 
\-l 1 / 

Now assume associated to each reaction i a rate constant q {i = 1,2, ...,8) and a "propensity 
function" hi{Xt, q), such that hi{Xt, Ci)dt is the probability of a type i reaction occurring in the 
time interval (t, t + dt\. A continuous time (diffusion) approximation to the discrete process Xt 
can be written as 

dXt = Sh{Xt, c)dt + S^ydiag{h{Xt,c))dWt (9) 

where Xt = {RNA, P, P2, DNAf, c = (ci , C2, Cg) and we take h{Xt, c) = {ciDNA x P2, C2{k — 
DNA),C3DNA,C4RNA,C5P{P - 1)/2,cgP2,C7RNA,csP)^ (notice that here x denotes multipli- 
cation), dWt = {dWt,i, ...,dWt,8)^ and the dWt,i are i.i.d. dWt,i ~ N{0,dt), i = 1,2, ...,8. Object 
of the inference is the estimation of c. 



Model (|9]) could have a more elegant formulation, see e.g. Golightly and Wilkinson (2010), 
however the proposed version solves some numerical issues related to taking the square root of 
non-positive definite matrices (arising when simulated trajectories turn negative for some value of 
c). However even when taking such precautions we are still not guaranteed that our state vector 
will stay non-negative for all t, hence we take absolute values diag{\h{Xt,c)\) under the square 
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root, same as in |Higham| ( [2008| . Notice that Q has the minor disadvantage of requiring a vector 
dWt having dimension larger than Xt, to preserve dimensionahty, i.e. we need to consider an 
8— dimensional vector of Brownian increments whereas dim(Xt) = 5, thus introducing unnecessary 
redundancy in the problem formulation. 

During the pilot phase we always discard simulated trajectories containing negative values, 
together with the corresponding parameter values. That is we do not record such trajecto- 
ries/parameters nor use them for any computation, but instead simulate parameters until an 
"admissible" (real valued, nonnegative) trajectory is produced. Of course such strategy increases 
the computational time considerably, however it has to be considered for the pilot phase only, as 
during the MCMC phase parameters corresponding to unadmissible trajectories get automatically 
rejected by the Metropolis-Hastings acceptance criterion. We used the same setup as in [Golightly 



and Wilkinson (2010), except where modifications were required, as discussed later on. In all exper- 



iments data were produced with the following : (ci, C2, C3, C4, C5, cg, cy, cs) = (0.1, 0.7, 0.35, 0.2, 0.1, 
0.9,0.3,0.1) and k is assumed to be known and set to k = 10. The starting value of the sys- 
tem is Xq = {RNAQ,PQ,P2fl,DNAQ) = (8,8,8,5). Since the c/s represent rates and are thus 
non-negative values the mathematical problem is reformulated in terms of log cj . We consider sev- 
eral scenarios, namely fully observed systems with and without measurement error and partially 
observed systems with measurement error. For all simulation designs data are generated via the 



"Gillespie algorithm" (Gillespie, 1977) on the time interval [0, 49], to ensure exact simulation of the 
biochemical system defined via the reactions {-Ri, Rs} and the constant rates cj specified above. 
Then the simulated Xt (non-negative integers representing number of molecules) are sampled at 
50 integer times {0, 1, 2, 49} and such {Xt-}i represent our data, unless Gaussian measurement 
error is added, and in the latter case data are denoted with {yti}i- However we conventionally 
denote data with {yti}i in all cases, to simplify the exposition. The dimension of the measured 
yi G R'^' at a given sampling time is rfj < 4, with equality holding for all i when considering 
a fully observed system. We use the Gillespie algorithm for data generation, but not when im- 
plementing our inferential method as it would result computationally costly; instead trajectories 
from the SDE are numerically generated via Euler-Maruyama using a constant stepsize h = 0.1. 



22 



Fully observed system 

We consider two different setup for the fully observed scenario, where all the coordinates of Xt are 
observed at the same set of time-points: in the first setup data are generated without measurement 
error, i.e. Eij = deterministically for every i and j {i = 0, l,...,n, j = l,...,d) and we denote 
the obtained data with Vi, (ii) in the second setup data are perturbed with measurement error 
Eij ~ A^(0, 2) independently for every i and for each coordinate j of the state vector Xf. In (ii) 
we assume the variance cr^ = 2 to be known and denote obtained data with We consider 
Vi first: the vector parameter object of our inference is = (logci, logC2, log Cs). The pilot 
consists of simpilot=100 iterations with simpar=200 for each simpilot iteration. Consider that 
here n x d = 200 is the overall number of recorded values from the 4-dimensional process being 
observed on the set of = 50 sampling times for each dimension. Therefore ideally simpar 
should be larger, as regression of artificial data on simulated parameters should produce a fit 
able to predict accurately the summary statistics for the actual data y. However the severe 
computational issues mentioned in section [7] do not allow for a large pilot study. Fortunately lasso 
is well known to be able to accommodate setups often denoted in literature with 'W ^ p" , i.e. 
responses having size smaller than the number of predictors. 

In the pilot we considered priors ttq of the type logCj ~ f/(— 3,0) for every j. A pilot study 
via Lasso produced a matrix whose columns are used to determine the prior densities to be used 
in algorithm [2] with the same strategies and considerations made for the Theophylline exam- 
ple. We thus defined the following priors 7r(0): logci ~ A^(— 2.6, 0.15^), logC2 ~ A^(— 0.6, 0.4^), 
logcs ~ A^(-1.5,0.42), logC4 ~ iV(-1.8,0.42), logcg ~ A^(-2.4, 0.1^), logcg ~ N{-1,0.6'^), 
logC7 ~ A^(— 1.85, 0.2^), logcg ~ A^(— 1.8, 0.3^). The diagonal matrix of weights A has entries 
[9.36, 3.48, 1.55, 1.72, 5.88, 1.81, 2.13, 1.78]. Prior to starting the MCMC we need to compute the 
statistics S{-) with Lasso, and this is accomplished using simpar=25x200=5,000 simulations. 
Algorithm [2] is executed for 3 million iterations, with a thinning of 50 values. We set for the 
bandwidth an exponential prior 6 ~ Exp(0.2) and a starting value equal to 0.2 (and set to 1 the 
maximum allowed value for 6 as generated by Metropolis random walk), resulting in an accep- 
tance rate of about 27% using adaptive Metropolis random walk. A burn-in of 250,000 draws 
(corresponding to 5,000 draws discarded after thinning) is considered and results are in Table |2} 
For Vi an analysis of the bandwidth effect on the posterior means is considered in Figure [3} and 
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for illustration purposes only plots pertaining to logC2 and logcg are reported, the ones for the 
remaining parameters not giving any additional information. 
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Table 2: Stochastic networks example: posterior means from the IVICIVIC output (first Une) and 95% central 
posterior intervals (second line). Depending on the simulation assumptions a ( — ) means that for the given data 
set the parameter is not part of the model or has been treated as a known constant. See main text for details. 

Variation in the posterior means are evident within the range allowed for 6 during the simu- 
lation, i.e. in the interval [0, 1]. Although theoretically we could conduct inference based on a 
small value for 6, in practice it is not safe to consider only those {^r}r corresponding to very small 
6 values, for the reason that we are going to explain. Although small 5's theoretically guarantee 
best results in terms of statistical accuracy, in practice they imply a reduced mixing. It is impor- 
tant to make use of plots of the type given in Figures [I]-[3] to detect sharp increase/decrease in 
the posterior means. We want results corresponding to "small"' S's but not only those, as they 
result often associated with low-mixing regions and, as such, regions where the exploration of the 
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(a) logC2 vs bandwidth (b) logcg vs bandwidth 

Figure 3: Stochastic networks example using Vi: posterior means from MCMC [±2 SD] vs bandwidth. 



posterior distribution is poor. We therefore filter the chain and keep the draws corresponding to 
relatively small (5's (for statistical precision) while allowing for larger S's as long as the posterior 
means look stable for varying 6: we ultimately select draws having 6 < 0.3. This provides us with 
about 42,200 draws which we further thin by a factor 6, and the remaining ~ 1, 890 draws used for 
posterior inference. Results for the relevant parameters are given in Table |2} Notice that, same as 
in Golightly and Wilkinson (2010), the estimates for C2 and ce are far from being excellent, however 



we are able to recover the ratio Ci/c2 which represents "propensity for the reversible repression 
reactions". Instead the "propensity for the forward reactions" Cs/cg is badly estimated because of 
the poor estimate for cg. A reassuring result on the efficacy of the method is the tightness of the 
posterior intervals for several parameters, considering that the initial (pre-pilot) priors for Cj have 
support in [0.05, 1]. 

Using the same hardware as in the previous example it requires 5.9 hrs to run the early- 
rejection algorithm |2} that is generating the 3,000,000 draws with a Matlab implementation 



(Golightly and Wilkinson (2011) using PMMH on a set of 100 observations required about 46 hrs 
of computation, and thus about 92 hrs for 200 observations using a C program). For data V2, 
where observations are affected by Gaussian zero-mean measurement error with known variance 
CT^ = 2, we followed a similar procedure. No striking differences emerge with respect to the results 
obtained on Vi, which is a result in itself, as the method is able to filter out the measurement 
error variability by returning estimates which are similar to those obtained with measurement- 
error- free data. However this is not really surprising, given that as = 1.41 implies relatively small 
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perturbations. 

A partially observed system 

Here we consider a further data set denoted with P3 having the DNA coordinate unobserved. 
That is we still consider Xt = [RN At, Pt, P2,t, DNAt)^ but only the first three coordinates are 
measured (at the same 50 integer times as previously described) and therefore at any sampling time 
t = tiwe have measurements for {RNAi, Pi, -P2,i)"^, that is yi G M.^ and a total of = 50 x 3 = 150 
observations is available. We assume that at any ti each coordinate of {RNA, P, P2)f is perturbed 
with independent homoscedastic noise, yi = {RNAi, Pi, P2,iY + with Sj ~ ^^"3(0, cr^Is), however 
this time ae is considered unknown and has to be estimated. Also, since DNA is not measured, 
in order to allow simulations from such coordinate we need to estimate its starting state DNAq 
and its pre-pilot prior ttq is logDNAQ ~ f/(l,2.1). We considered 6 ~ Exp{0.1). As expected a 
partially observed system requires the simulation of long chains before some apparent stationarity 
is approached: in this case we generated 20,000,000 draws with a thinning of 100. A burn-in 
of 25,000 draws (corresponding to 2.5 million draws in the non-thinned chain) is discarded and 
remaining draws are filtered by retaining those corresponding to 6 < 0.25, see Table [2] for results. 
Interestingly, point estimates do not result very different from the previous attempts with Pi and 
T>2, despite the increased uncertainty in the simulation setup. The initial state DNAq is correctly 
identified but residual variation is not. 



8 Summary 

We considered approximate Bayesian computation (ABC) to perform inference for complex, pos- 
sibly multidimensional, partially observed stochastic dynamical systems subject to measurement 
error. A Metropolis-Hastings algorithm for ABC is proposed, reducing the computational effort 
by 50-60% in our experiments, therefore making the application of the considered methodology 
feasible. The setup can be largely automated, thanks to recent results on determination of in- 



formative summary statistics as proposed in Fearnhead and Prangle (2012). As shown in the 



simulation studies, results are encouraging. It would be unreasonable to expect to attain the same 
level of statistical efficiency provided by exact (non-ABC) Bayesian methodology or likelihood 
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function maximization. On the other hand "exact" methods (or Monte Carlo approximations 
thereof) are rarely feasible under complex multidimensional scenarios where the computational 
requirements are still highly demanding. However, using recent results from ABC literature we 
showed that a flexible methodology is available and expect to provide a Matlab package to fit 
a large class of SDE models at www. maths . 1th. se/matstat/ staff /umberto/t This work should 



result particularly useful to modellers dealing with multidimensional stochastic systems, rather 
than experimenting with one-dimensional models without measurement error for which a number 



of theoretical and computational results are available (see the review papers by S0rensen (2004) 



and Hurn et al. (2007)). On the other side we attempted at performing inference for multidimen- 
sional SDE models incorporating measurement error, which is an unavoidable (and non-trivial) 
task to seriously study increasingly large models for real world applications. 
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